Logistic regression
Regression analysis, Softmax regression
In the logistic regression, the dependent variable is categorical (possibly ordinal) and we would like to estimate the probability. Instead of dealing with the probability directly, we can use the log odds of the probability. Log-odds is a convenient quantity because it varies . It is also called a logit function.
For a probability , the odds are defined as . The log odds approaches as and approaches as .
Let
Now we can formulate a model where
or
This is the logistic regression model. We can think of it as an application of linear regression on the logit of the probability, or as applying a logistic transformation to to obtain a bounded probability value. The logit and the logistic function are the inverse of the other, and the logit function is called a “link function” in the context of Generalized linear model framework.
Interpretation
Marginal effects and odds ratio
See also Marginal effects
The odds ratio captures multiplicative change in the dependent variable upon the change in an independent variable; by contrast, the marginal effect cpatures additive change in the dependent variable.
Because the logit function is the logarithm of the odds, the odds is . This means that if we calculate the odds ratio between the odds with and that with , we obtain . In other words, if we simply exponentiate a coefficient of the model, it gives us the odds ratio upon a unit change in the corresponding variable. Moreover, the odds ratio is a constant regardless of the value of the independent variables.
By contrast, the marginal effect is how much the probability (dependent variable) changes when we change an independent variable. Because it’s about probability, not the odds, it describes an additive change and, unlike the odds ratio, varies depending on the other variables. For binary variable, the marginal effect is the amount of change upon the change of the variable from 0 to 1. Specifically, .
For continuous variables, it is the instantaneous rate of change (“dy/dx”).
Issues
- If you want to ruin your trust in the interpretation of multivariable logistic regression coefficients, read about
- Post selection inference
- infite sample bias
- non-collapsibility
- table 2 fallacy
- colliders and intermediates
- separation
- model misspecification
Tutorials
Tools
Statsmodels
- Logistic Regression in Python Using Rodeo
- Machine Learning for Hackers Chapter 2, Part 2: Logistic regression with statsmodels
common usage patterns
Using R-like formula. It takes care of categorical dummay variables and you can apply transformations (e.g. log) on the fly.
import statsmodels.formula.api as smf
result = smf.logit('DV ~ x1 + x2 + np.log(x3) + x4*x5', data=df).fit()
result.summary()
Calculating the odds ratio with 95% CI.
conf = result.conf_int()
conf['odds_ratio'] = result.params
conf.columns = ['2.5%', '97.5%', 'odds_ratio']
np.exp(conf)
F-test with human-readable restriction formula
result.f_test('x1 = x2 = 0')
result.f_test('x1 = x2')
Get average marginal effects (use at
parameter for other marginal effects).
margins = result.get_margeff() # marginal effects
margins.summary()
margins.summary_frame() # get a data frame
Selecting a reference (pivot) dummy
Currently, statsmodels does not support this choice. But because statsmodels picks a dummy using alphabetical order, we can simply replace the dummary variable that we want to have. For instance, if we have a gender
column with m
and f
values but we want to have m
as the pivot, then simply replace it with a
.
df.gender.replace('m', 'a', inplace=True)
We can use a similar trick for the multinomial logit.